Introduction

The data we have chosen to look at is Housing Prices in California. This data comes from Kaggle (https://www.kaggle.com/datasets/fedesoriano/california-housing-prices-data-extra-features) and outlines data that would go in to predicting the price of a house in California. As people who currently rent (and one of us living in California), we hope to one day be able to purchase a home and being able to understand this model could help us determine important factors in predicting the price and whether future ones we intend to buy are a good deal or not. Our model will predict Median House Value, which will take possible predictors like Median Income and Total Rooms to predict that price.

Methods

First we need to load in the data and prepare some of the columns. To augment the data a bit, we need to take the predictors Distance to Los Angeles, Distance to Los Angeles, Distance to Los Angeles, and Distance to Los Angeles and convert them in to a single column that is a factor variable. This segments the data a bit in to regions of California and if there is any relevance in being closer to one city vs. another.

library(readr)
housing_data =  read_csv("California_Houses.csv")
## Rows: 20640 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): Median_House_Value, Median_Income, Median_Age, Tot_Rooms, Tot_Bedr...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nearest_city = rep("", nrow(housing_data))

nearest_city_options = c("LA", "San Diego", "San Jose", "San Fransisco")

for (i in 1:nrow(housing_data)) {
 subset = housing_data[i,c("Distance_to_LA", "Distance_to_SanDiego", "Distance_to_SanJose", "Distance_to_SanFrancisco")]
 
 nearest_city[i] = nearest_city_options[which.min(subset)]
}

housing_data$nearest_city = as.factor(nearest_city)

Below is a peak at the data, with the new added column

head(housing_data)
## # A tibble: 6 × 15
##   Median_House_Value Median_Income Median_Age Tot_Rooms Tot_Bedrooms Population
##                <dbl>         <dbl>      <dbl>     <dbl>        <dbl>      <dbl>
## 1             452600          8.33         41       880          129        322
## 2             358500          8.30         21      7099         1106       2401
## 3             352100          7.26         52      1467          190        496
## 4             341300          5.64         52      1274          235        558
## 5             342200          3.85         52      1627          280        565
## 6             269700          4.04         52       919          213        413
## # … with 9 more variables: Households <dbl>, Latitude <dbl>, Longitude <dbl>,
## #   Distance_to_coast <dbl>, Distance_to_LA <dbl>, Distance_to_SanDiego <dbl>,
## #   Distance_to_SanJose <dbl>, Distance_to_SanFrancisco <dbl>,
## #   nearest_city <fct>
set.seed(420)
housing_data_idx  = sample(nrow(housing_data), size = trunc(0.80 * nrow(housing_data)))
housing_data_trn = housing_data[housing_data_idx, ]
housing_data_tst = housing_data[-housing_data_idx, ]

With the data loaded and prepped we want to start building the model. Before we do that, we want to check the pairs of all the different variables to see if there are and predictors we need to transform.

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(housing_data_trn,
        columns = c(1, 2:5),        # Columns
        aes(color = nearest_city,  # Color by group (cat. variable)
            alpha = 0.5))

ggpairs(housing_data_trn,
        columns = c(1, 6:9),        # Columns
        aes(color = nearest_city,  # Color by group (cat. variable)
            alpha = 0.5))

ggpairs(housing_data_trn,
        columns = c(1, 10:13),        # Columns
        aes(color = nearest_city,  # Color by group (cat. variable)
            alpha = 0.5))

ggpairs(housing_data_trn,
        columns = c(1, 14:15),        # Columns
        aes(color = nearest_city,  # Color by group (cat. variable)
            alpha = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Results

Discussion

Appendix

Group Members

  • Brayden Turner - brturne2
  • Caleb Cimmarrusti - Calebtc2